Intro

Column

Column

This report reviews all of the statistical modeling and analysis results on data concerning various restaurants around the world. The purpose of this study is to discover and understand the root cause of what makes a successful restaurant, successful. The primary interest is to distinguish variables that have significance in regard to restaurant ratings and why that may be, in terms of various variables such as the location and status of the restaurant in question. Additionally, this report is designed to help visualize the trends in what it takes to encourage and retain high ratings within this industry.

In this study we handle a large dataset with over 10,000 entries, which we have attained from Kaggle. The dataset was originally acquired by Mehta in*** with the help of the Zomato API.

From a consumer’s perspective, there are many factors that can differentiate a restaurant from being average, to one that will be remembered for being great. We hypothesize that most measures provided within our dataset, such as the location, costs, cuisines, votes, table booking and delivery options, will generally have a significant impact when held against restaurant ratings. There have been many studies conducted in the past, that concerns how online reviews have the power to improve the value of a restaurant. In specific, the study lead by Anderson and Magruder (2011) discovered that a restaurant’s outlook can easily be altered, with the help of online reviews. The pair had focused on the impact of positive online interactions and found that social media and online forums heavily impacted on how consumers judged the quality and services of restaurants. This leads us to believe that the variable Votes, within our dataset, may be highly significant when compared to ratings. This would suggest the level of popularity, which would influence improvements within the many attributes of restaurants. Another study, conducted by a group of students from the Vanderbilt University (2004), suggested that hygiene of restaurants has a high significance level, pertaining to restaurant success. This would suggest us to believe that location, in specific, is likely to impact restaurant ratings.

Tidying Data

Loading Files From Kaggle Source - Loading all libraries and files required for this analysis.

#------LOAD DATA------------
library(tidyverse)
library(readxl)
library(leaflet)
library(gridExtra)
library(broom)
library(plotly)
library(gridExtra)

#------LOAD FILES---------

CCodesfile="~/project/Country-Code.xlsx"
Zomatofile="~/project/zomato.xlsx"
CurrencyFile="~/project/CurrencyRates.xlsx"

CCodes=read_excel(CCodesfile) #importing country code file
zomato=read_excel(Zomatofile) #importing main zomato file
Currency=read_excel(CurrencyFile) #importing currency file

The original data set contained varying currencies per country, making it difficult to compare each restaurant uniformly. We created a seperate file containing the United States Dollar ($USD) conversion rates for each country, which will be used to convert current currency figures into new standard ones.

Data Wrangling - Although the data is presented pre-cleaned into appropriate columns, there still remains some level of manipulation that needs to be done.

#------ GET RID OF SOME COLUMNS & ADD NEW ONES -------------
zomato=zomato %>% select (-c(Rating_Color,Switch_To_Order_Menu,Address,Locality_Verbose,Currency))
#we remove currency column as it is incorrect and contains too many unique symbols

#add in new columns: Country Names according to Country Code & proper currency names with conversion rates
zomato = zomato %>% left_join(CCodes) 
zomato = zomato %>% left_join(Currency) 

#add in new column: USD Cost (for equal comparison)
zomato = zomato %>% mutate(Avg_Cost_USD = Average_Cost_For_Two*`Conversion Rate (USD)`) 
#Multiplied Currency Column by it's conversion rate

#replacing Cuisines col. with Principal_Cuisines (the first category of cuisines for every country row)
zomato= zomato %>% separate(Cuisines,into=c("Principal_Cuisines"),sep=',') #store "primary" cuisine types into new column and replace old with this

#Make sure rating categories are ordered:
zomato = zomato %>% mutate(Rating_Text=ordered(Rating_Text,levels=c("Excellent","Very Good","Good","Average","Poor")))

We begin by removing unnecessary columns and adding in necessary ones:

  1. Add in countries by country code via the country code file provided by kaggle

  2. Add in our own currency conversion file, allowing us to convert average cost for two into the USD amount and storing it in it’s own column called “Avg Cost USD”.

We notice there were many cuisines listed per restaurant, so to simplify things for analytic purposes, we extract the first cuisine listed by each restaurant.

Removing Missing Values - Since we will focusing on the ratings of restaurants, we remove restaurants with missing aggregate rating values.

#---------REMOVE MISSING VALUES----------
#we notice that there are several "0" rated stores for even expensive places so we decided to remove no rating stores
zomato[zomato == 0] = NA #remove values that have zero
zomato[zomato == "Not rated"] = NA #remove unrated values
zomato = zomato %>% filter(!is.na(Aggregate_Rating))

india.data = zomato %>% filter(Country == "India")
us.data = zomato %>%  filter(Country== "United States")

Exploratory Analysis

Distribution Of Variables

* Distribution of Rating Text & Aggregate Rating** After plotting the distribution of Rating Text, we see that majority of the restaurants from our data is rated ‘Average’ and the number of Restaurants decrease, almost exponentially, as the rating gets better. This is not surprising as there should only be a limited amount of ‘Excellent’ and ‘Very Good’ restaurants, since there is often a differentiating factor between good and great places that only a few can achieve. We notice that this is slightly right skewed but that seems to make sense, given the context behind it.

To visualize the distribution of Aggregate Rating, we decided to generate a histogram. We notice here that the distribution of this variable is relatively normal, with most of the Ratings ranging from three to four. This would suggest that on average, most restaurants in our study are rated relatively well by their consumers. There is a bit of a high bump around 3.84; however, this is not much of a setback since we are working with a relatively large dataset

Both of these variables measure the success factor of restaurants in this study. It is clear that the distributions of the two are similar but not the exact same. Aggregate Rating seems to be more normally distributed than Rating Text, most likely because it is a more detailed measure.

* Distribution of Average Cost USD** Here we focus on the Average Costs of restaurants. It is clear to see how severely right skewed the distribution is. This helps us understand that almost all restaurants, other than the few outliers to the right, serve at a relatively affordable price that would range roughly beneath 74 USD. If we take a closer look at the costs, located within the tail of this histogram, we notice that they are all within the hundreds. Most of these restaurants are located in more developed countries, such as the United Kingdom, the United States, or countries that have high costs of living, such as Qatar, UAE, and Singapore. Interestingly enough, the highest outlier is just under 360 USD, and the restaurant in question is located in Singapore, a country with one of the highest GDP per capita in the world.

Distribution of Price Range Price Range seems to behave in a very similar way to Average Costs, which also suggests that most places are more affordable than not, leaving only a few choices ranging in higher prices.

* Distribution of Votes** Just like Average Costs, the distribution pertaining to the frequency of Votes each restaurant accumulated, is also heavily skewed to the right. Most of the Votes seem to be under a frequency of 100. It is ironic that both the lowest and highest Vote counts of 4 and 10930, were collected from restaurants in India. This may be due to the wide scale of information we have from India, which would allow for such diverse popularity levels within restaurants.

* Distribution of Table Booking, Online Delivery and Delivering Now** All three variables of, Table Booking, Online Delivery and Delivering Now, has a similar characteristic, which states that there are only a few restaurants that offer these options to their consumers. It would be interesting to see how these options effect aggregate rating, if they have any effect at all.

* Distribution of Locality** Once again, we notice right away that the most dominant country within this distribution is India, which unfortunately serves as one of the limitations of this dataset.

* Distribution of City** Clearly the most dominant city within our dataset is New Delhi, which happens to serve as the capital of India. This is not surprising since our data heavily concerns India. The second most concentrated city is the city of Gurgaon, which is also in India. This is not surprising as majority of this data was attained from India, explaining why this distribution is so heavily asymmetric.


Distribution of Country We decided to produce a histogram to get a better understanding for the assortment of countries, in which our data is based on. It was surprising to see that more than 80% of the data was collected from restaurants located in India. We notice that the United States is the second most concentrated place, after India. In comparison to these two countries, the rest of the data is scattered lightly throughout the globe.

How Variables React to Aggregate Rating

***

Aggregate Rating VS Average Cost USD There seems to be a slow upward trend here, suggesting that rating increases as costs goes up. In industries like this, similar trends can be expected as the quality of restaurants heavily depend costs.

Aggregate Rating VS Price Range Similar to Average Costs, Price Range once again is telling a similar story, due to its clear positive slope. There are a few outliers within the Price Range of 3 and 4; however, it would be safe to disregard this when dealing with a large dataset.

***

Aggregate Rating VS Votes The purpose of comparing Votes to Aggregate rating is to see if popularity plays a part in the success of a restaurant. In this case, it does seem to have a moderate impact, as the median levels of ratings seem to gradually increase, as the number of Votes go up. The upward slope hits its maximum point at a rating of 4.7 with a mean of 649 votes, after which we can see a light decrease. There does seem to be a number of outliers but that tends to be common within large sets of data.

Aggregate Rating VS Country We can see here that the medians of all countries, lie above the Aggregate Rating of 3. This would suggest that all countries normally do well, in terms of restaurant ratings. The highest mean Aggregate Rating of 4.5 was attained in the Philippines and the lowest rating of 3.3 was actually from India. This may be due to the fact that there is a higher cost of living in the Philippines than India, causing ratings to rise.


Aggregate Rating VS Table Booking, Online Delivery and Delivering Now Unlike Table bookings, which has an even distribution for both options and displays a positive slope as Aggregate Rating increases, both Online Delivery and Delivering Now options seem to have no effect as Aggregate Rating increases. It seems that the medians for both options within both variables, Online Delivery and Delivering Now, are constant around the Aggregate Rating of 3.4; however, the distributions do differ. We notice that there lie a few outliers, but they can be safely dismissed, given the size of this dataset.

India Compared to the United States - how does India behave in comparison to the other Countries?: Distribution

* Distribution of Rating Text** The distributions of Rating Text for India compared to the United States is interesting, since both are asymmetric but are skewed to different directions. Within India, the data is skewed to the right, whereas in United States it is skewed to the left. We can notice that the most common rating within restaurants in India is “Average”; however, in United States, most restaurants are rated as “Very Good”.

* Distribution of Aggregate Rating** In India, Aggregate Rating is more normally distributed. The larger chunk of restaurants seems to fall within Aggregate Rating ranging from 3 to a bit over 3.5, which seems to align nicely with the distribution of Rating Text. Within the United States we see something similar, where the distribution of Aggregate Rating seems to be parallel to Rating text, once again. We can roughly conclude that in general, restaurants in the United States are rated better than the restaurants in India.

* Distribution of Average Cost USD**


Distribution of City Earlier we discovered that almost 80% of the information within this dataset was collected from India, once again we experience a similar shock by noticing how extremely asymmetric the distribution of City is, in India. Almost 50% of the data collected within India was from its capital, New Delhi. Though we did not have as much information pertaining to the United States, we can notice that the distribution of City, in United States, is a more evenly distributed nonetheless.

India VS United States - Let’s see how the variabbles reacting to Aggregate Rating of India differ from how they react from other countries

* Aggregate Rating VS Average Cost USD** This comparison makes it easy to notice the disadvantage United States has, due to the lack of information, in comparison to India. Taking a look at India first, we notice that under the $12 USD mark, there is a large portion of restaurants that can vary immensely in Aggregate Rating. Meaning that for $12 USD, a consumer can find a place to eat that is very highly rated, poorly rated and everything in between. This is most likely due to the amount of information we have from India. For the United States, we share a similar concept of having a wide range for Aggregate Rating, given Average Cost. The difference with this country is that there is a lacking variety, when focusing on Average Cost, which was mentioned before. Within both countries however, we do see that there is a slightly positive relation between these two variables, suggesting that when Average cost increases, so does Aggregate Rating but the options of restaurants in turn decreases.

* Rating Text VS Price Range** These plots resemble the distributions of the variable, Rating Text, as seen before; however, it is interesting to see that unlike India, United States has a fairly even distribution of Price Range within each rating category. We can see that within the more common ratings of “Good”, “Very Good” and “Excellent”, most restaurants fall in the Price Range of 2, followed by 1, 3 and then 4. In India, we see that the most frequent rating of “Average”, has the highest number of restaurants within the Price Range of 1. This plot lets us see that for both countries, the common Price Ranges, amongst restaurants, are more affordable than not.


Rating Text VS Votes Both India and United States have a positive relationship with Votes, when comparing to Rating Text. This suggests that as the ratings increase, Votes do as well and this can serve as a popularity factor as well. This would mean that more popular destinations have a higher change of being rated better. The differentiating factor within the two countries is that, India seems to have less of an impact than United states. This is because the slope of medians, within the United States, is more dramatic. We also notice that there is a dip between the ratings “Good” and “Poor” within the plot of United States. This may be caused due to there being a lack of information, or the fact that as a developed country, people are more likely to use technology more often to share bad reviews. This would then explain why there are more votes for restaurants that attained poor ratings.

Map of Restaurants

First Stage Analysis

Linear Regression - Model with all the possible predictor variables in our dataset.

  Aggregate Rating
Predictors Estimates CI p
(Intercept) 3.47 3.41 – 3.53 <0.001
Observations 6504
R2 / adjusted R2 0.597 / 0.537

Since we are aware most of our data comes from New Delhi (specifically Connaught Place) and the most common cuisine is North Indian, we set our baseline for City, Locality and Principal Cuisines in respect to this. This is to gain a general insight on how deviating from these popular choices, effects aggregate rating. The first linear regression model includes all of the following variables:

  • Aggregate Rating (response variable)

  • City

  • Principal Cuisine

  • Table Booking Option

  • Online Delivery Option

  • Price Range

  • Average Cost for Two (USD)

This model yields a R squared value of 59.73% and is statistically significant. Included on the left is a table containing all the coefficent estimates for this model.

Drop Test - Checking if any variables can be dropped from the current model

Single term deletions

Model:
Aggregate_Rating ~ City + Locality + Principal_Cuisines + Has_Table_Booking + 
    Has_Online_Delivery + Price_Range + Avg_Cost_USD
                     Df Sum of Sq    RSS   AIC  F value    Pr(>F)    
<none>                            126430 20993                       
City                 19      1092 127522 21011   2.5716 0.0002015 ***
Locality            728     66826 193256 22297   4.1072 < 2.2e-16 ***
Principal_Cuisines   71     15068 141498 21583   9.4956 < 2.2e-16 ***
Has_Table_Booking     1      1298 127728 21058  58.0806 2.932e-14 ***
Has_Online_Delivery   1       150 126580 20999   6.7022 0.0096541 ** 
Price_Range           1         8 126438 20992   0.3380 0.5610341    
Avg_Cost_USD          1      3174 129604 21152 142.0286 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conducting a drop test using F statistic shows we can drop the variable Price Range from our model without it being affected. Thus, we remove this variable and proceed to our diagnostic graph analysis.

Diagnostic Graphs (1) - Checking the model’s residual fitted graph for accuracy of this model


Our residuals seem to be rather random, with the bulk of it being around zero. The fitted line helps illustrate the overall trend of the residuals which in our case seems to fall slightly below zero.

Diagnostic Graphs (2) - Checking normality of residuals via QQ Plot


QQ Plot shows that the center region of the residuals are normal but the lower tail deviates heavily along with a few outliers in the upper tail.

Diagnostic Graphs (3) - Checking for homoscedasticity via Scale Location


We ideally want to observe a rather straight horizontal line for this plot to prove equal variance of errors. Unfortunately, this scale location plot suggests heavy heteroscedasticty.

Diagnostic Graphs (4) - Checking for influential observations via Cook’s Distance


We notice 3 main observations: 1525, 1786 and 3998 are the most influential observations for our regression.

Second Stage Analysis

Transforming Response Variable - We apply our first transformation to the dependent variable and observe its effects.

  Transformed Rating
Predictors Estimates CI p
(Intercept) 0.37 0.34 – 0.39 <0.001
Observations 6504
R2 / adjusted R2 0.662 / 0.611

We add a new transformed variable to dataset called “Transformed Rating”. Since our Aggregate Rating level is limited between the values 0 to 5, we set a restriction that prevents rating estimates to go above or beyond this range. The transformation is as follows: \[log_{10}{\frac{Aggregate Rating}{5-Aggregate Rating}}\]

The transformed linear regression model returns an improved R squared value of 66.2% (a 7% increase from original model).

Drop Test - Checking if any variables can be dropped from the transformed model

Single term deletions

Model:
Transformed_Rating ~ City + Locality + Principal_Cuisines + Has_Table_Booking + 
    Has_Online_Delivery + Avg_Cost_USD
                     Df Sum of Sq   RSS   AIC  F value  Pr(>F)    
<none>                            38947 13333                     
City                 19     257.8 39205 13338   1.9712 0.00710 ** 
Locality            728   27035.9 65983 15306   5.3950 < 2e-16 ***
Principal_Cuisines   71    5385.2 44333 14033  11.0187 < 2e-16 ***
Has_Table_Booking     1    1062.3 40010 13506 154.3239 < 2e-16 ***
Has_Online_Delivery   1      24.1 38971 13335   3.4981 0.06149 .  
Avg_Cost_USD          1    2435.6 41383 13725 353.8218 < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conducting a drop test using F statistic shows we can now drop another variable Has Online Delivery from our transformed model without it being affected.

Diagnostic Graphs (1) - Fitting a residuals graph shows random residual points in the transformed model

Diagnostic Graphs (2) - QQ Plot improved from original model but the upper and lower tails deviate from normality.

Diagnostic Graphs (3) - Heteroscedasticity still evident and requires more investigation

Checking Individual Variables (Average Cost for Two) - The linear regression with Average Cost for Two returns relatively random residuals

City Regression - Investigating if the predictor variable City can be transformed to improve the model


At first glance, there seems to be some form of pattern with the residuals for the predictor variable: City. However, this residual plot may require further analysis since it is a categorical variable.

City Regression (2) - Seems to be non-normal distribution of principal cuisines residuals

Locality Regression - The residual plot for the regression of Transformed Rating ~ Locality is random and raises no red flags.

Principal Cuisines Regression - Similar to the regression for City, we experience dips in certain areas of the residual graph

Principal Cuisines Regression (2) - Non-normal distribution of (most) principal cuisine residuals

Booking Regression - The traditional residual graph for the variable: Has Table Booking is hard to interpret as it is a binary variable

Booking Regression (2) - Using boxplots to analyze the residuals for Booking Option variable


Using boxplots we can graph the differences between the residuals and variable values. Notice that the median (and mean) is in the negatives, meaning the mean residuals for this linear regression are not zero. This could be one of the limitations causing our regression’s diagnostic plots to appear distorted.

Statistically Significant Predictors - List of variables that are significantly better or worse than baselines: New Delhi, Connaught Place and North Indian


Please note that because we are working with categorical variables that contained many options, our term estimates are harder to interpret. In this case, because we set our baselines, the coefficent estimates in this table tell us how each term effects transformed rating when compared to only the base term.

Conclusion

The dataset we chose contained recorded information on 9551 different restaurants from around the world. A sample size this large is bound to contain useful information pertaining to restaurant performance.

We began with data visualization in order to let the data paint its own picture and determine what areas are of interest. At first glance we notice that our data is heavily skewed, with more than 80% of the restaurants being located in India. With such a heavy skew, analyzing the data as a whole would be a mistake since the results of the other countries will be influenced by India.

As such, we isolate the restaurants in India from the rest of the world and further investigate. Our key indicator of restaurant performance is its rating measure - both Rating Text and Aggregate Rating - the first allowing for a general overview and the latter for more accurate regression analysis.

After identifying these trends among restaurants in India, we move on to analyze what variables actually affect restaurant performance. We identified a skew distribution in number of votes per restaurant during our exploratory analysis and take this into consideration when running our regression. We use votes as a weight parameter for our regression in order to obtain a better analysis.

Since we are working with a set range of ratings: 0 to 5, we set a limit to restrict regression estimates from going above or below this level and use this as our transformed response variable: \[0<log_{10}{\frac{Aggregate Rating}{5-Aggregate Rating}}<5\]

After running diagnostic plot tests with our regression models we identify the following variables have correlation to restaurant rating:

  1. Average Cost for Two (USD) - Increasing cost results in increased rating
  2. Locality - Aminabad has better restaurant ratings than Connaught Place, while Spot 18 Mall (Pimple Saudagar) is much worse
  3. City - Mysore has restaurants with worse ratings than New Dehli
  4. Booking Option - Restaurants with the option to book are slightly worse off.
  5. Principal Cuisines - No cuisine type is significantly better than North Indian Cuisine

Although we noted the regression results for the variable Principal Cuisines had no statistically significant estimates, it does not mean this variable is insignificant. As determined by our drop test, there does exist some significance for this category. Since our baseline was North Indian, our regression model tells us that there is no significant difference between North Indian vs other cuisines, however, there does exist significant differences between other cuisines. The same can be said for our other two variables Locality and City.

In general, the two variables Average Cost for Two (USD) and Booking Option can estimate the direct effect on restaurant ratings. For every dollar increase in average cost, transformed rating increases by 0.37365388 (holding all else constant) and for restaurants with the option to book ahead of time, surprisingly the transformed rating decreases by 0.09528946.

It is important to note that our selected regression model does not directly return an estimated aggregate rating. Let Y be the estimated value the regression model outputs; in order to undo the transformation on our response variable, we must perform the following: \[Estimated Aggregate Rating = \frac{(5)(10)^{Y}}{1+10^{Y}}\]

Discussion

This study is concerning a very large dataset with over 10,000 entries. Normally this would be associated with many benefits, as it provokes accurate results; however, our greatest problem with this dataset is that it was heavily based in one specific country: India.

This became a limitation when it failed to display a fair comparison of restaurants across the world, as only a limited amount of information was retrieved for all the countries besides India. The uneven assortment of information made it difficult to conclude with the accuracy that was originally intended.

Due to the heavy skewness of the data, it was difficult to fit a model that resulted in desirable diagnostic plots. Given these outliers, our selected regression model returned a randomly scattered residual fitted graph and showed normal distribution of errors for the bulk of the data. However, homogeneity of residual variance is where we run into another limitation. Since there is a lot of variability within our data, certain large outliers tend to skew our scale-location plot.

The other problematic nature of this dataset is that it did not provide other relative measures that would have an impact on restaurant ratings, and what it takes for consumers to enjoy their experiences in such places. In regard to this drawback, we have missed out on variables that may have had a significant impact on our findings. In a future study, these limitations can be avoided by collecting a more detailed selection of data. Working with data that provides a uniform distribution, regarding the sourcing of information, is crucial to statistical studies. In our experience, we were able to compare 2 of the most concentrated countries, India and United States; however, this was not a fair comparison as the amount of information on both these countries were extremely distinctive. We would also be recommended to pay close attention to the variables in question and whether it would be helpful to merge data from different sources, which would likely cover a greater scale of subjects, and thus conclude with even more accurate findings.

In future studies, it will be beneficial to extend the number of variables by including the following:

  1. Quality of Service
  2. Quality of Food
  3. Presentation of Restaurant
  4. Cleanliness

The response variable Aggregate Rating did not explicitly define what aspect of the restaurant it was rating; rather it is assumed to be a general summation of all factors. Therefore, adding specific rating variables such as the ones listed above, will help better understand where these general ratings are coming from.

References

Jones, T. F., Pavlin, B. I., LaFleur, B. J., Ingram, L. A., & Schaffner, W. (2004). Restaurant Inspection Scores and Foodborne Disease. Emerg Infect Dis.

Anderson, M., & Magruder, J. (n.d.). Learning from the Crowd: Regression Discontinuity Estimates of the Effects of an Online Review Database*. The Economic Journal.

A special thanks to Professor Ken Butler for being such a big part of this study and lending us his guidance.